Right now i am feeling great becuase i have started something really new. I knew R exist and we could do lot of statistical analyses and data visulaization with it, but working with R and github togther is very amazing. From this course i am expecting to expand my basics of R with some new techniques on data visulaization. I got the information about this course from my colleague.
Link to my github repository is here :https://github.com/Hem-web/IODS-project
setwd("~/IODS-project")
learning2014 <- read.table("~/IODS-project/data/learning2014.txt")
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)
## [1] 166 7
pairs(learning2014[!names(learning2014) %in% c("gender")],col=learning2014$gender)
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
ggpairs(learning2014,
mapping = aes(col = gender, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)))
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
Running the regression model with all five independent variables, attitude, surf, stra, age, and deep to identify the strongest impact of them on points.
my_model <- lm(points ~ attitude + surf + deep + stra + age, data = learning2014)
results <- summary(my_model)
knitr::kable(results$coefficients, digits=3, caption="Regression coefficients")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 18.301 | 5.241 | 3.492 | 0.001 |
| attitude | 3.432 | 0.570 | 6.022 | 0.000 |
| surf | -1.105 | 0.840 | -1.316 | 0.190 |
| deep | -1.044 | 0.782 | -1.335 | 0.184 |
| stra | 0.966 | 0.539 | 1.791 | 0.075 |
| age | -0.097 | 0.053 | -1.809 | 0.072 |
After running the model with five explanatory variables, we observed that the strongest effect on score points was due to attitude towards statistics. Nevertheless, let’s remove stra, age, and deep and run the model again to see if including surf in the model makes any changes.
my_model1 <- lm(points ~ attitude + surf, data = learning2014)
results <- summary(my_model1)
knitr::kable(results$coefficients, digits=3, caption="Regression coefficients")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 14.120 | 3.127 | 4.515 | 0.000 |
| attitude | 3.426 | 0.576 | 5.944 | 0.000 |
| surf | -0.779 | 0.796 | -0.979 | 0.329 |
It seems that even after removing those three variables, the statistical paramters of surface learning appproach did not changed and remained statistical insignificant suggesting that attitude is only the explanatory variable that impacted the points mostly. So, let’s run the model again with only attitude as an explanatory variable.
my_model3 <- lm(points ~ attitude, data = learning2014)
results <- summary(my_model3)
knitr::kable(results$coefficients, digits=3, caption="Regression coefficients")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 11.637 | 1.830 | 6.358 | 0 |
| attitude | 3.525 | 0.567 | 6.214 | 0 |
The final model summay suggest that as intercept and attitude are statistically singnificant. The attitude estimate 3.525 means that in every 1 level increase of attiutde the point score will be increased by 3.525 whereas intercept of 11.637 means that people with zero attitude can still get 11.637 points. The Coefficient of determination \(R^2\) = 0.1905537 indicates that contribution of attitude towards scoring point is only 19%. This suggest that there must be other explanatory variables which could influence the points.
plot(my_model3, which=c(1,2,5))
setwd("~/IODS-project")
alc <- read.table("~/IODS-project/data/pormath.txt")
names(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
str(alc)
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
There are many interesting variables in the data which could strongly link to the alcohol consumption rate.However, I find followings varibales quite interesting and propose my personal hypothesis for each of them in reagrds with alcohol consumption.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#For graphical distribution I will use box plots of variables against the alcohol consumptions between male and female students. For that let's run following script
g1 <- ggplot(alc, aes(x = high_use, y = G1))
g1 + geom_boxplot() + ylab("G1")
g2 <- ggplot(alc, aes(x = high_use, y = freetime))
g2 + geom_boxplot() + ylab("freetime")
g3 <- ggplot(alc, aes(x = high_use, y = studytime))
g3 + geom_boxplot() + ylab("studytime")
g3 <- ggplot(alc, aes(x = high_use, y = absences))
g3 + geom_boxplot() + ylab("absences")
m <- glm(high_use ~ G1 + freetime + studytime + absences, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ G1 + freetime + studytime + absences,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.885 -0.828 -0.634 1.110 2.129
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.49537 0.72601 -0.682 0.495036
## G1 -0.08391 0.04591 -1.828 0.067614 .
## freetime 0.33103 0.12468 2.655 0.007929 **
## studytime -0.45144 0.15958 -2.829 0.004671 **
## absences 0.07957 0.02245 3.544 0.000395 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 423.08 on 377 degrees of freedom
## AIC: 433.08
##
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept) G1 freetime studytime absences
## -0.49536973 -0.08390687 0.33102562 -0.45144113 0.07956876
OR <- coef(m) %>% exp
CI<-confint(m)
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.6093456 -1.92821285 0.925191583
## G1 0.9195169 -0.17508225 0.005453013
## freetime 1.3923955 0.08987902 0.579750936
## studytime 0.6367099 -0.77323579 -0.146089296
## absences 1.0828200 0.03763152 0.126082337
# fit the model
m <- glm(high_use ~ G1 + freetime + absences, data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = high_use)
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = high_use, y = probability))
# define the geom as points and draw the plot
g + geom_point()
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 268 0
## TRUE 0 114
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = 0)
## [1] 0.2984293
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
## [1] 506 14
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
## [1] "matrix"
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
## crime
## low med_low med_high high
## 127 126 126 127
## [1] 506 14
## 'data.frame': 506 obs. of 14 variables:
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.593 -1.306 -1.306 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.74 -0.834 -0.834 ...
## $ rm : num 0.413 0.194 1.281 1.015 1.227 ...
## $ age : num -0.12 0.367 -0.266 -0.809 -0.511 ...
## $ dis : num 0.14 0.557 0.557 1.077 1.077 ...
## $ rad : num -0.982 -0.867 -0.867 -0.752 -0.752 ...
## $ tax : num -0.666 -0.986 -0.986 -1.105 -1.105 ...
## $ ptratio: num -1.458 -0.303 -0.303 0.113 0.113 ...
## $ black : num 0.441 0.441 0.396 0.416 0.441 ...
## $ lstat : num -1.074 -0.492 -1.208 -1.36 -1.025 ...
## $ medv : num 0.16 -0.101 1.323 1.182 1.486 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...
## 'data.frame': 404 obs. of 14 variables:
## $ zn : num -0.487 -0.487 -0.487 -0.487 3.372 ...
## $ indus : num 2.42 -0.376 1.015 -0.376 -1.19 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num 0.469 -0.299 1.599 -0.299 -1.335 ...
## $ rm : num -1.239 0.63 0.49 0.171 1.143 ...
## $ age : num 1.056 0.402 0.925 0.598 -1.697 ...
## $ dis : num -0.969 -0.483 -0.793 -0.513 1.668 ...
## $ rad : num -0.637 -0.522 1.66 -0.522 -0.982 ...
## $ tax : num 1.796 -0.144 1.529 -0.144 -0.731 ...
## $ ptratio: num 0.76 1.129 0.806 1.129 -1.458 ...
## $ black : num -0.138 0.417 -2.704 -3.131 0.417 ...
## $ lstat : num 1.585 -0.453 1.487 -0.283 -0.673 ...
## $ medv : num -1.689 0.54 -0.993 -0.428 1.051 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 2 2 4 2 1 1 2 4 1 1 ...
## 'data.frame': 102 obs. of 13 variables:
## $ zn : num 0.2845 -0.4872 -0.4872 0.0487 0.0487 ...
## $ indus : num -1.287 -0.593 -1.306 -0.476 -0.476 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.834 -0.265 -0.265 ...
## $ rm : num 0.413 0.194 1.227 0.131 -0.563 ...
## $ age : num -0.12 0.367 -0.511 0.914 -1.051 ...
## $ dis : num 0.14 0.557 1.077 1.212 0.786 ...
## $ rad : num -0.982 -0.867 -0.752 -0.522 -0.522 ...
## $ tax : num -0.666 -0.986 -1.105 -0.577 -0.577 ...
## $ ptratio: num -1.458 -0.303 0.113 -1.504 -1.504 ...
## $ black : num 0.441 0.441 0.441 0.393 0.371 ...
## $ lstat : num -1.074 -0.492 -1.025 1.092 0.428 ...
## $ medv : num 0.1595 -0.1014 1.486 -0.819 -0.0906 ...
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2524752 0.2450495 0.2475248 0.2549505
##
## Group means:
## zn indus chas nox rm age
## low 0.9920210 -0.9095619 -0.07933396 -0.8861860 0.4286954 -0.8863034
## med_low -0.0807736 -0.3077574 0.04582045 -0.5839326 -0.1021549 -0.3313248
## med_high -0.3749021 0.1861743 0.16075196 0.3942514 0.1792948 0.4541959
## high -0.4872402 1.0170891 -0.04298342 1.0630065 -0.4477067 0.8431199
## dis rad tax ptratio black lstat
## low 0.9099117 -0.6846209 -0.7401168 -0.47578260 0.3736738 -0.77232497
## med_low 0.3709499 -0.5665670 -0.5271433 -0.09657007 0.3227870 -0.15241666
## med_high -0.4064509 -0.4087862 -0.3082238 -0.31018499 0.1378815 -0.03126471
## high -0.8617179 1.6384176 1.5142626 0.78111358 -0.7675264 0.89700635
## medv
## low 0.50543804
## med_low 0.03432369
## med_high 0.21922049
## high -0.67980353
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.144070406 0.709642048 -1.03124436
## indus -0.001331493 -0.220046594 0.26542633
## chas -0.083326910 0.008234877 0.14745167
## nox 0.384199429 -0.760294029 -1.31104864
## rm -0.116280044 -0.130155162 -0.16550316
## age 0.259019433 -0.371105939 -0.17197205
## dis -0.122000757 -0.289294653 0.29932997
## rad 3.108937843 1.070984973 -0.10788412
## tax 0.052007852 -0.177904094 0.55242924
## ptratio 0.106408482 0.045193655 -0.20829231
## black -0.126803966 0.001198025 0.08123095
## lstat 0.204259436 -0.181864168 0.60791505
## medv 0.164011835 -0.375127929 0.05379806
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9472 0.0396 0.0133
## [1] low med_low med_low med_low med_low med_low med_low med_low
## [9] med_low med_low med_low low low low med_low med_low
## [17] low med_low med_low med_low med_low med_low low med_low
## [25] med_high med_low med_low med_high med_low med_high med_high med_high
## [33] med_high med_high med_high med_high med_high med_high med_high med_low
## [41] med_low med_low low low low med_low med_low med_low
## [49] med_low med_high med_low med_low med_low med_low low med_low
## [57] med_high low low low low low med_high med_low
## [65] med_low med_low med_low med_low low med_low low high
## [73] high high high high high high high high
## [81] high high high high high high high high
## [89] high high high high high high high high
## [97] med_high med_high med_high med_high med_high med_high
## Levels: low med_low med_high high
library(MASS)
data('Boston')
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
class(boston_scaled)
## [1] "matrix"
boston_scaled<-as.data.frame(boston_scaled)
dist_eu <- dist(boston_scaled)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
boston_scaled<-as.matrix(boston_scaled, method= "manhattan") # saving the scaled data under manhattan matrix.
dist_man <- dist(boston_scaled)
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# k-means clustering
km <-kmeans(Boston, centers = 4)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
pairs(Boston[6:10], col = km$cluster)
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
km <-kmeans(Boston, centers = 2)
pairs(Boston, col = km$cluster)
+ A abrupt drop in the total within-cluster sum of squares (WCSS) happneded when the number of clusters (x-axis) reached around 2 suggesting that optimum no of clusters is most likely 2 or less. No change in WCSS even after increasing the number of clusters upt 10 also suggest result produced using 2 or less number of cluster could be the similar.
library(MASS)
data(Boston) # load the Boston data set
boston_scaled1<- scale(Boston)# scale the Boston data set again - named boston_scaled1
boston_scaled2<-as.data.frame(boston_scaled1)
km1 <-kmeans(Boston, centers = 3)# k-means clustering
cluster <- km1$cluster
boston_scaled2<- data.frame(boston_scaled1,cluster) # add the cluster number to the df
lda.fit<- lda(cluster~., data = boston_scaled2)#linear discriminant analysis of clusters vs. all other variables
lda.fit# print the lda.fit object
## Call:
## lda(cluster ~ ., data = boston_scaled2)
##
## Prior probabilities of groups:
## 1 2 3
## 0.5296443 0.1996047 0.2707510
##
## Group means:
## crim zn indus chas nox rm
## 1 -0.3920779 0.27670879 -0.6513071 0.0214843827 -0.6152775 0.2573427
## 2 -0.3293317 -0.07332724 0.2818828 0.0005392655 0.2816899 -0.1453417
## 3 1.0097765 -0.48724019 1.0662784 -0.0424254043 0.9959393 -0.3962652
## age dis rad tax ptratio black
## 1 -0.4572006 0.5121870 -0.6013344 -0.78136288 -0.2690134 0.34109296
## 2 0.1822823 -0.2378455 -0.5418150 -0.01444889 -0.3768823 0.07010933
## 3 0.7599946 -0.8265965 1.5757732 1.53915759 0.8040926 -0.71893398
## lstat medv
## 1 -0.43621538 0.36234147
## 2 0.01371321 -0.03812375
## 3 0.84321670 -0.68070813
##
## Coefficients of linear discriminants:
## LD1 LD2
## crim 0.048210477 0.05079118
## zn 0.253528315 0.06311589
## indus 0.369497254 0.12674727
## chas -0.047064817 0.01998369
## nox -0.063156250 -0.49621758
## rm -0.005144383 0.09537352
## age -0.118710969 0.05412142
## dis -0.385151599 0.17969944
## rad 1.996321584 3.05733525
## tax 4.535785039 -2.77688761
## ptratio 0.122064688 0.19196217
## black -0.029200518 0.06353722
## lstat 0.085030308 0.12666624
## medv 0.157444662 -0.10356584
##
## Proportion of trace:
## LD1 LD2
## 0.9812 0.0188
#the function for lda biplot
lda.arrows<-function(x, myscale=1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes1 <- as.numeric(boston_scaled2$cluster)# target classes as numeric
plot(lda.fit, dimen = 2, col = classes1, pch = classes1, main = "LDA biplot using three clusters 1, 2 and 3")# plot the lda results
lda.arrows(lda.fit, myscale = 2)
knitr::opts_chunk$set(echo = FALSE)
setwd("~/IODS-project")
human <- read.table("~/IODS-project/data/human.txt")
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ secedu.R : num 1.007 0.997 0.983 0.989 0.969 ...
## $ labour.ratio: num 0.891 0.819 0.825 0.884 0.829 ...
## $ expect.edu : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ life.expB : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI.capita : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ MMR : int 4 6 6 5 6 7 9 28 11 8 ...
## $ adols.BR : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ par.percent : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155 8
#Read the required packages
library(MASS)
library(dplyr)
library(ggplot2)
library(GGally)
## secedu.R labour.ratio expect.edu life.expB
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI.capita MMR adols.BR par.percent
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
## secedu.R labour.ratio expect.edu life.expB
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI.capita MMR adols.BR par.percent
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
## [1] 300 36
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
Comment on structure and dimension of the data